Open Access Evidence in Unpaywall using Python

Analyzing the open access status of scholarly articles is a challenging task that many libraries, universities and researchers are facing. In this blog post, we discuss a Python based approach for analyzing journal articles using the Unpaywall database. In particular, we build upon the results of an existing blog article which used R and provide a new python implementation.

Nick Haupka
2020-02-23

The Unpaywall database is a collection of millions of scholarly fulltext articles that were harvested from a variety of different repositories and services like PubMed Central and DOAJ. Although Unpaywall offers a REST API to query the Unpaywall database, it is more pleasant and efficient to use the provided database snapshots, which are usually released twice a year.

In this blog post, we discuss the results from the blog article “Open Access Evidence in Unpaywall” by Najko Jahn and Anne Hobert by comparing different database snapshots obtained from Unpaywall. Since the article was published in May 7 2019, Unpaywall has released a new data dump with up-to-date information about scholarly journal articles. Therefore it might be interesting to see how the change in data impacts the results of our analysis compared to the previous blog article. For this post we are using the Unpaywall data dump from November 2019.

In order to query the Unpaywall dataset, which is stored in Google BigQuery, we use the google-cloud-bigquery package. By default, this package is not shipped with pandas. However, we recommend to use this package along with pandas to have a better experience. The major reason for this being, that the queried data can easily be exported into a pandas DataFrame.


from google.cloud import bigquery
import pandas as pd
from matplotlib import pyplot as plt
import matplotlib.patches as mpatches
import matplotlib.ticker as mtick
import seaborn as sns
import upsetplot

The setup for the Google BigQuery Python client is very simple. Firstly, we import bigquery from the google.cloud package and call the project name on the Client method. If we didn´t set our credentials for the Google Cloud service yet, Google asks us to verify our client. This can be done by simply exporting an API-Key provided by Google into the working environment or by installing the official Google Cloud SDK. A detailed description on how to authenticate the client can be found here. Notice that our project-database has restricted access.


client = bigquery.Client(project='api-project-764811344545')

According to the aforementioned blog article, the dataset being used has two tables, containing records between 2008 and 2018. For reusability, we define two variables which are holding the table´s names to call them continuously in SQL queries. We are also using the improved string formatting syntax which is a new feature since Python 3.6.


# database snapshots
upw_08_12 = '`oadoi_full.mongo_export_upwNov19_08_12`'
upw_13_18 = '`oadoi_full.mongo_export_upwNov19_13_19`'

client.query(f"""
            SELECT * 
            FROM {upw_13_18} 
            WHERE year=2018
            AND genre="journal-article" 
            LIMIT 10
            """).to_dataframe()
#>   oa_status  ...  data_standard
#> 0      gold  ...              2
#> 1    bronze  ...              2
#> 2      gold  ...              2
#> 3      gold  ...              2
#> 4      gold  ...              2
#> 5     green  ...              2
#> 6     green  ...              2
#> 7      gold  ...              2
#> 8     green  ...              2
#> 9      gold  ...              2
#> 
#> [10 rows x 14 columns]

Open Access availability (is_oa)

To contrast the previous results from the blog article with the recent results, we begin with a comparison of the total number of articles between the two datasets from February 2019 and November 2019. More imoprtantly, we investigate the open access share between these two.


oa_08_12 = client.query(f"""
                        SELECT year, is_oa, COUNT(DISTINCT(doi)) AS n 
                        FROM {upw_08_12} 
                        WHERE genre="journal-article" 
                        GROUP BY year, is_oa
                        """).to_dataframe()

oa_13_18 = client.query(f"""
                        SELECT year, is_oa, COUNT(DISTINCT(doi)) AS n 
                        FROM {upw_13_18} 
                        WHERE year<2019 AND genre="journal-article" 
                        GROUP BY year, is_oa
                        """).to_dataframe()

df = pd.concat([oa_08_12, oa_13_18])
df.year = pd.to_datetime(df.year.apply(lambda x: str(x) + "-01-01"))
df['prop'] = df.groupby(['year'])['n'].transform(lambda x: x / x.sum())
df = df.sort_values(by=['year']).reset_index(drop=True)
df.head()
#>         year  is_oa        n      prop
#> 0 2008-01-01   True   646079  0.312379
#> 1 2008-01-01  False  1422175  0.687621
#> 2 2009-01-01  False  1528180  0.676550
#> 3 2009-01-01   True   730604  0.323450
#> 4 2010-01-01   True   823574  0.328862

For visualization we are using matplotlib. Matplotlib is a commonly used plotting library for Python that is influenced by Matlab and enables a wide range of different plot figures. Even though pandas allows us to directly call a plot method on a DataFrame, we feel more confortable with using essential matplotlib methods. Especially, when customizing the figure.


# increase plot resolution
plt.rcParams['figure.dpi'] = 600

x = df.year.unique()
y1 = df[df.is_oa == True].n
y2 = df[df.is_oa == False].n

plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [10, 4]
plt.grid(False, which='both', axis='x')
plt.box(False)
plt.stackplot(x, y1, y2, colors=['#56B4E9', '#b3b3b3a0'], alpha=0.8)
plt.title('Open Access to Journal Articles', fontdict={'fontsize': 15, 'fontweight': 600}, \
          pad=20)
plt.xlabel('Year published', labelpad=10, fontdict={'fontsize': 12, 'fontweight': 500})
plt.ylabel('Journal Articles', labelpad=10, fontdict={'fontsize': 12, 'fontweight': 500})
is_oa_patch = mpatches.Patch(color='#56B4E9', label='TRUE')
is_not_oa_patch = mpatches.Patch(color='#b3b3b3a0', label='FALSE')
plt.legend(handles=[is_oa_patch, is_not_oa_patch], title='Is OA?', \
           fontsize='medium', bbox_to_anchor=(1.2, 1.05), labelspacing=1.2)

plt.tight_layout()

plt.show()

As expected, the total number of journal articles has increased compared to the previous results from the blog article. In fact, the number of articles with a distinct DOI included in the Unpaywall data dump from November 2019 has increased by 2% in comparison to the February 2019 data dump used in the original blog post. In addition, the share of open access articles has increased by 6%. Overall, 13,569,137 freely available articles were published between 2008 and 2018.

Unpaywall Open Access Hosting Types (host_type)

Next, we investigate the differences between the distribution of the host types specified in the data dumps. As mentioned in the blog article, the host type variable is defined by Unpaywall’s algorithm.


HOST_TYPE_08_12_QUERY = f"""
                        SELECT year, host_type, journal_is_in_doaj, 
                            COUNT(DISTINCT(doi)) AS number_of_articles 
                        FROM {upw_08_12}, UNNEST (oa_locations) 
                        WHERE genre="journal-article" AND is_best=true 
                        GROUP BY year, host_type, journal_is_in_doaj
                        """

HOST_TYPE_13_18_QUERY = f"""
                        SELECT year, host_type, journal_is_in_doaj, 
                            COUNT(DISTINCT(doi)) AS number_of_articles 
                        FROM {upw_13_18}, UNNEST (oa_locations) 
                        WHERE genre="journal-article" AND year<2019 AND is_best=true 
                        GROUP BY year, host_type, journal_is_in_doaj
                        """

host_type_08_12_query_df = client.query(HOST_TYPE_08_12_QUERY).to_dataframe()
host_type_13_18_query_df = client.query(HOST_TYPE_13_18_QUERY).to_dataframe()

host_type_df = pd.concat([host_type_08_12_query_df, host_type_13_18_query_df])
host_type_df.year = pd.to_datetime(host_type_df.year.apply(lambda x: str(x) + "-01-01"))
host_type_df = host_type_df.sort_values(by=['year']).reset_index(drop=True)

host_type_df.loc[host_type_df['host_type'] == 'publisher', 'host'] = 'Other Journals'
host_type_df.loc[host_type_df['host_type'] == 'repository', 'host'] = 'Repositories only'
host_type_df.loc[host_type_df['journal_is_in_doaj'] == True, 'host'] = 'DOAJ-listed Journal'

host_type_df.head()
#>         year   host_type  ...  number_of_articles                 host
#> 0 2008-01-01  repository  ...              258587    Repositories only
#> 1 2008-01-01   publisher  ...               92337  DOAJ-listed Journal
#> 2 2008-01-01   publisher  ...              295155       Other Journals
#> 3 2009-01-01   publisher  ...              332639       Other Journals
#> 4 2009-01-01   publisher  ...              114857  DOAJ-listed Journal
#> 
#> [5 rows x 5 columns]

all_articles = host_type_df.groupby(['year'])['number_of_articles'].sum() \
                            .reset_index(name='number_of_articles')

x = all_articles['year'].dt.year
y_total = all_articles.number_of_articles

plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [11, 3.5]

fig,(axes) = plt.subplots(nrows=1, ncols=3)

fig.suptitle('Open Access to Journal Articles by Unpaywall host', fontsize=16, y=1.10)

for i, host in enumerate(host_type_df.host.unique(), 1):
    
    ax = plt.subplot(1,3,i)
    y_stacked = host_type_df[host_type_df.host==host] \
                    .groupby(['year'])['number_of_articles'].sum() \
                    .reset_index(name='number_of_articles').number_of_articles
    ax.bar(x, y_total, color='#b3b3b3a0')
    ax.bar(x, y_stacked, color='#56B4E9')
    ax.set_title(host, fontdict={'fontsize': 12, 'fontweight': 500})

    ax.set_frame_on(False)
    ax.grid(False, which='both', axis='x')

    if i > 1:
        ax.set_yticklabels([], visible=False)
        
# common xlabel   
fig.text(0.45, -0.03, 'Year', ha='center', 
         fontdict={'fontsize': 13, 'fontweight': 500})

# common ylabel
fig.text(-0.02, 0.5, 'OA Articles (Total)', va='center', rotation='vertical', 
         fontdict={'fontsize': 13, 'fontweight': 500})

by_host_patch = mpatches.Patch(color='#56B4E9', label='by Host')
all_oa_patch = mpatches.Patch(color='#b3b3b3a0', label='All OA Articles')
plt.legend(handles=[all_oa_patch, by_host_patch], fontsize='medium', \
           bbox_to_anchor=(1.05, 0.9), labelspacing=1.2)

fig.tight_layout()

plt.show()

As can be seen from the figure, the proportion of open access articles that are accessible through journals which are not listed in DOAJ has rapidly increased in the year 2018 in comparison to the previous results. Nonetheless, the overall share of articles obtained from Journals that are not listed in DOAJ has decreased from 56% to 47%. Consequently, the share of open access articles provided by DOAJ-listed journals has increased.

Unpaywall Open Access Evidence Types (evidence)

In the following, we explore varieties between the evidence types of the two data dumps. The evidence type is a variable determined by Unpaywall, which expresses the location at which an article was found and how Unpaywall was able to identify the open access status of an article.


EVIDENCE_08_12_QUERY = f"""
                        SELECT evidence, year, is_best, 
                            COUNT(distinct(doi)) AS number_of_articles 
                        FROM {upw_08_12}, UNNEST (oa_locations) 
                        WHERE genre="journal-article" 
                        GROUP BY evidence, year, is_best
                        """

EVIDENCE_13_18_QUERY = f"""
                        SELECT evidence, year, is_best, 
                            COUNT(distinct(doi)) AS number_of_articles 
                        FROM {upw_13_18}, UNNEST (oa_locations) 
                        WHERE genre="journal-article" AND year < 2019 
                        GROUP BY evidence, year, is_best
                        """

evidence_08_12 = client.query(EVIDENCE_08_12_QUERY).to_dataframe()
evidence_13_18 = client.query(EVIDENCE_13_18_QUERY).to_dataframe()

evidence_df = pd.concat([evidence_08_12, evidence_13_18])
evidence_df.year = pd.to_datetime(evidence_df.year.apply(lambda x: str(x) + "-01-01"))

evidence_df.head()
#>                                             evidence  ... number_of_articles
#> 0            oa repository (semantic scholar lookup)  ...             109096
#> 1              oa repository (via OAI-PMH doi match)  ...             103866
#> 2  oa repository (via OAI-PMH title and first aut...  ...              37305
#> 3                   oa repository (via pmcid lookup)  ...               5651
#> 4                   oa repository (via pmcid lookup)  ...             120415
#> 
#> [5 rows x 4 columns]

articles_per_type_df = evidence_df.groupby(['evidence']).number_of_articles \
                        .sum() \
                        .to_frame() \
                        .reset_index() \
                        .sort_values(by=['number_of_articles'], ascending=False) \
                        .reset_index(drop=True)
    
articles_per_type_df['prop'] = articles_per_type_df['number_of_articles'] \
                                .transform(lambda x: 
                                           x / articles_per_type_df['number_of_articles'] \
                                           .sum() * 100)

articles_per_type_df['cumul'] = articles_per_type_df.prop.cumsum()

articles_per_type_table = articles_per_type_df.copy()
articles_per_type_table.prop = articles_per_type_table.prop \
                                    .apply(lambda x: '{0:.2f}'.format(x))

articles_per_type_table.cumul = articles_per_type_table.cumul \
                                    .apply(lambda x: '{0:.2f}'.format(x))

articles_per_type_table.columns = ['Evidence Types', 
                                   'Number of Articles', 
                                   'Proportion of all Articles in %', 
                                   'Cumulative Proportion in %']

articles_per_type_table
Evidence Types Number of Articles Proportion of all Articles in % Cumulative Proportion in %
0 oa repository (semantic scholar lookup) 5759641 20.12 20.12
1 oa repository (via OAI-PMH doi match) 4467382 15.6 35.72
2 open (via free pdf) 4427452 15.47 51.19
3 open (via page says license) 3653314 12.76 63.95
4 oa journal (via doaj) 3577409 12.5 76.45
5 oa repository (via pmcid lookup) 3237547 11.31 87.75
6 oa repository (via OAI-PMH title and first author match) 1699654 5.94 93.69
7 oa journal (via observed oa rate) 781574 2.73 96.42
8 open (via crossref license) 594003 2.07 98.5
9 open (via page says Open Access) 101344 0.35 98.85
10 open (via free article) 100145 0.35 99.2
11 oa journal (via publisher name) 74618 0.26 99.46
12 open (via crossref license, author manuscript) 64929 0.23 99.69
13 oa repository (via OAI-PMH title match) 51979 0.18 99.87
14 oa repository (via OAI-PMH title and last author match) 37218 0.13 100
15 oa journal (via issn in doaj) 392 0 100
16 manual 27 0 100
17 hybrid (via page says license) 1 0 100

sum_other = articles_per_type_df \
                .loc[articles_per_type_df['prop'] < 1] \
                .number_of_articles \
                .sum()

articles_per_type_other_df = pd.DataFrame([['other', sum_other]], 
                                          columns=['evidence', 'number_of_articles'])

articles_per_type_keep_df = articles_per_type_df.loc[articles_per_type_df['prop'] > 1]

articles_per_type_grouped_df = pd.concat([articles_per_type_other_df, 
                                          articles_per_type_keep_df])

articles_per_type_grouped_df = articles_per_type_grouped_df \
                                .groupby(['evidence']) \
                                .number_of_articles.sum() \
                                .to_frame().reset_index() \
                                .sort_values(by=['number_of_articles'], ascending=False) \
                                .reset_index(drop=True)

articles_per_type_grouped_df['prop'] = articles_per_type_grouped_df['number_of_articles'] \
                        .transform(lambda x: 
                        x / articles_per_type_grouped_df['number_of_articles'].sum() * 100)

articles_per_type_grouped_df['cumul'] = articles_per_type_grouped_df.prop.cumsum()

evidence_grouped_df = evidence_df.copy()

list_of_small_evidence_types = articles_per_type_df \
                                .loc[articles_per_type_df['prop'] < 1] \
                                .evidence.tolist()

evidence_grouped_df.evidence = evidence_grouped_df \
                                .evidence.replace(list_of_small_evidence_types, 'other')

evidence_grouped_df = evidence_grouped_df.groupby(['evidence', 'is_best', 'year']) \
                                .number_of_articles.sum() \
                                .to_frame().reset_index() \
                                .sort_values(by=['number_of_articles'], ascending=False) \
                                .reset_index(drop=True)

evidence_grouped_plot_df = evidence_grouped_df \
                            .groupby(['evidence', 'is_best']) \
                            .number_of_articles.sum() \
                            .to_frame().reset_index() \
                            .sort_values(by=['number_of_articles'], ascending=True) \
                            .reset_index(drop=True)

articles_per_type_grouped_plot_df = articles_per_type_grouped_df \
                                    .sort_values(by=['number_of_articles'], ascending=True) \
                                    .reset_index(drop=True)

y1 = articles_per_type_grouped_plot_df.evidence
x_total = articles_per_type_grouped_plot_df.number_of_articles
x_stacked = evidence_grouped_plot_df[evidence_grouped_plot_df.is_best == True] \
                        .set_index('evidence') \
                        .reindex(index=articles_per_type_grouped_plot_df['evidence']) \
                        .reset_index().number_of_articles

plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [7, 5]
plt.grid(False, which='both', axis='y')
plt.box(False)

ax1 = plt.barh(y1, x_total, color='#b3b3b3a0')
ax2 = plt.barh(y1, x_stacked, color='#56B4E9')

plt.title('Number of Open Access Articles per Unpaywall Evidence Type', 
          fontdict={'fontsize': 15, 'fontweight': 600}, pad=20)
plt.xlabel('Number of Open Access Articles', labelpad=10, 
           fontdict={'fontsize': 13, 'fontweight': 500})

plt.ylabel('Evidence Type', labelpad=10, fontdict={'fontsize': 13, 'fontweight': 500})

is_best_patch = mpatches.Patch(color='#56B4E9', label='TRUE')
is_not_best_patch = mpatches.Patch(color='#b3b3b3a0', label='FALSE')
plt.legend(handles=[is_best_patch, is_not_best_patch], title='Is best?', 
           fontsize='medium', bbox_to_anchor=(1.2, 1.05), labelspacing=1.2)

plt.show()

The figure provides two types of evidence that were not specifically mentioned in the previous work: ‘oa repository (semantic scholar lookup)’ and ‘oa journal (via observed rate)’. Apparently these evidence types are responsible for the overall increase in articles between the two data dumps from February and November 2019.


y1 = evidence_grouped_df.sort_values(by=['year'], ascending=True) \
                        .reset_index(drop=False) \
                        ['year'].dt.year.unique()

plt.style.use('seaborn-whitegrid')

fig, (axes) = plt.subplots(nrows=5, 
                           ncols=2, 
                           sharex=True, 
                           sharey=True, 
                           figsize=(12,12))

# get a one-dimensional array
axes = axes.reshape(-1)

fig.suptitle('Unpaywall Open Access Evidence Categories per Year', fontsize=16, y=1.05)

for i, ax in enumerate(axes, 1):
    
    ax = plt.subplot(5,2,i, sharey=axes[0], sharex=axes[0])    
    
    if i % 2 == 0:
        plt.setp(ax.get_yticklabels(), visible=False)
    
    evidence = evidence_grouped_df.evidence.unique()[i-1]

    x_total = evidence_grouped_df[evidence_grouped_df.evidence == evidence] \
                            .groupby(['year']) \
                            .number_of_articles.sum() \
                            .to_frame().reset_index() \
                            .number_of_articles

    x_stacked = evidence_grouped_df[evidence_grouped_df.evidence == evidence] \
                            .loc[evidence_grouped_df.is_best == True] \
                            .groupby(['year']) \
                            .number_of_articles.sum() \
                            .to_frame().reset_index() \
                            .number_of_articles

    ax = plt.bar(y1, x_total, color='#b3b3b3a0')
    ax = plt.bar(y1, x_stacked, color='#56B4E9')
    plt.title(evidence, fontdict={'fontsize': 12, 'fontweight': 500}, pad=0.2)
    
    plt.grid(False, which='both', axis='x')
    plt.box(False)

# common xlabel   
fig.text(0.52, -0.03, 'Publication Year', ha='center', 
         fontdict={'fontsize': 13, 'fontweight': 500})

# common ylabel
fig.text(-0.04, 0.5, 'Number of Open Access Articles', va='center', rotation='vertical', 
         fontdict={'fontsize': 13, 'fontweight': 500})

is_best_patch = mpatches.Patch(color='#56B4E9', label='TRUE')
is_not_best_patch = mpatches.Patch(color='#b3b3b3a0', label='FALSE')
fig.legend(handles=[is_best_patch, is_not_best_patch], title='Is best?', 
           fontsize='large', title_fontsize='x-large', 
           bbox_to_anchor=(1.15, 0.95), labelspacing=1.2)

plt.tight_layout()

plt.show()

Overlap of Open Access Provision and Evidence Types

Owing to possible multiple associations between an article and evidence types in Unpaywall, we investigate the intersection between host types in the next step. Again, we compare the results with the previous results.


HOST_TYPE_INTERSECT_08_12_QUERY = f"""
                                    SELECT year, host_type_count, 
                                        count(distinct(doi)) as number_of_articles 
                                    FROM 
                                        (SELECT doi, year,
                                            STRING_AGG(DISTINCT(host_type) 
                                        ORDER BY host_type) 
                                        as host_type_count 
                                        FROM {upw_08_12}, UNNEST (oa_locations)
                                        WHERE genre="journal-article" 
                                        GROUP BY doi, year) 
                                    GROUP BY host_type_count, year 
                                    ORDER BY number_of_articles desc
                                    """

HOST_TYPE_INTERSECT_13_18_QUERY = f"""
                                    SELECT year, host_type_count, 
                                        count(distinct(doi)) as number_of_articles 
                                    FROM 
                                        (SELECT doi, year, 
                                        STRING_AGG(DISTINCT(host_type) 
                                        ORDER BY host_type) as host_type_count
                                        FROM {upw_13_18}, UNNEST (oa_locations) 
                                        WHERE genre="journal-article" AND year < 2019 
                                        GROUP BY doi, year) 
                                    GROUP BY host_type_count, year 
                                    ORDER BY number_of_articles desc
                                    """

host_type_08_12_intersect_df = client.query(HOST_TYPE_INTERSECT_08_12_QUERY).to_dataframe()
host_type_13_18_intersect_df = client.query(HOST_TYPE_INTERSECT_13_18_QUERY).to_dataframe()
host_type_intersect_df = pd.concat([host_type_08_12_intersect_df, 
                                    host_type_13_18_intersect_df])

host_type_intersect_df.year = pd.to_datetime(host_type_intersect_df.year \
                                             .apply(lambda x: str(x) + "-01-01"))

host_type_intersect_df \
    .loc[host_type_intersect_df['host_type_count'] == 'publisher', 'host'] = 'Publisher only'
host_type_intersect_df \
    .loc[host_type_intersect_df['host_type_count'] == 'publisher,repository', 
         'host'] = 'Publisher & Repository'
host_type_intersect_df \
    .loc[host_type_intersect_df['host_type_count'] == 'repository', 
         'host'] = 'Repositories only'

articles_total_by_year_df = df.groupby(['year']).n.sum().to_frame().reset_index()
articles_total_by_year_df.columns = ['year', 'all_articles']

host_type_intersect_df = pd.merge(articles_total_by_year_df, 
                                  host_type_intersect_df, on='year', how='right')

host_type_intersect_df = host_type_intersect_df.groupby(['year', 'host']) \
                                    .sum().eval('prop = number_of_articles/all_articles') \
                                    .reset_index()

host_type_intersect_df.head()
#>         year                    host  ...  number_of_articles      prop
#> 0 2008-01-01  Publisher & Repository  ...              184598  0.089253
#> 1 2008-01-01          Publisher only  ...              202894  0.098099
#> 2 2008-01-01       Repositories only  ...              258587  0.125027
#> 3 2009-01-01  Publisher & Repository  ...              216379  0.095794
#> 4 2009-01-01          Publisher only  ...              231117  0.102319
#> 
#> [5 rows x 5 columns]

host_type_all = host_type_intersect_df.copy()
host_type_all['prop'] = host_type_intersect_df.groupby(['year'])['prop'] \
                    .transform(lambda x: x.sum())

This time we are using the seaborn package for generating bar plots. Seaborn is a visualization library which is build on top of matplotlib. It is well designed to fit with pandas DataFrames and it also enables smoother plots. Also, we can continue to use matplotlib methods.


x = host_type_intersect_df['year'].dt.year
y = host_type_all.prop

plt.style.use('seaborn-whitegrid')
plt.rcParams['figure.figsize'] = [11, 3.5] 

fig,(axes) = plt.subplots(nrows=1, ncols=3)

fig.suptitle('Overlap between Open Access Host Types in Unpaywall', fontsize=16, y=1.10)
    
for i, host in enumerate(host_type_intersect_df.host.unique(), 1):
    ax = plt.subplot(1,3,i)
    y_stacked = host_type_intersect_df[host_type_intersect_df.host == host].prop
    sns.barplot(x, y, color='#b3b3b3a0', alpha=0.8, saturation=1, ci=None)
    sns.barplot(x, y_stacked, color='#56B4E9', alpha=1, saturation=1, ci=None)
    ax.set_title(host, fontdict={'fontsize': 12, 'fontweight': 500})
    
    ax.set_frame_on(False)
    ax.grid(False, which='both', axis='x')
    ax.set(xlabel='', ylabel='')
    for label in ax.get_xticklabels()[1::2]:
        label.set_visible(False)
        
    ax.yaxis.set_major_formatter(mtick.PercentFormatter(1.0))
    
    if i > 1:
        ax.set_yticklabels([], visible=False)
        
# common xlabel   
fig.text(0.45, -0.03, 'Year', ha='center', 
         fontdict={'fontsize': 13, 'fontweight': 500})

# common ylabel
fig.text(-0.02, 0.5, 'OA Share', va='center', rotation='vertical', 
         fontdict={'fontsize': 13, 'fontweight': 500})
    
by_host_patch = mpatches.Patch(color='#56B4E9', label='by Host')
all_oa_patch = mpatches.Patch(color='#b3b3b3a0', label='All OA Articles')
plt.legend(handles=[all_oa_patch, by_host_patch], 
           bbox_to_anchor=(1.05, 0.9), labelspacing=1.2)

plt.tight_layout()

plt.show()

The figure shows, that in terms of percentage, fewer articles were found on publisher websites than in the results of the blog article. Indeed, 73% of all open access full-texts are available through publisher websites which is a total of 9,970,245 articles. This is a decrease by almost 9% compared to the previous results. Also, the proportion of articles that are not archived in a repository has declined from 56% to 35%. Consequently, the proportion of articles that are available from both the publisher websites, as well as those that are archived in a repository have increased from 26% to 39%.


EVIDENCE_SINGLE_CAT_08_12_QUERY = f"""
                                    SELECT ev_cat, COUNT(DISTINCT(doi)) AS number_of_articles
                                    FROM 
                                        (SELECT doi, STRING_AGG(DISTINCT(evidence), "&" 
                                        ORDER BY evidence) AS ev_cat
                                        FROM {upw_08_12}, UNNEST (oa_locations)
                                        WHERE genre="journal-article"
                                        GROUP BY doi
                                        )
                                    GROUP BY ev_cat
                                   """

EVIDENCE_SINGLE_CAT_13_18_QUERY = f"""
                                    SELECT ev_cat, COUNT(DISTINCT(doi)) AS number_of_articles
                                    FROM 
                                        (SELECT doi, STRING_AGG(DISTINCT(evidence), "&" 
                                        ORDER BY evidence) AS ev_cat
                                        FROM {upw_13_18}, UNNEST (oa_locations)
                                        WHERE genre="journal-article" AND year < 2019
                                        GROUP BY doi
                                        )
                                    GROUP BY ev_cat
                                   """

evidence_categories_08_12_df = client.query(EVIDENCE_SINGLE_CAT_08_12_QUERY).to_dataframe()
evidence_categories_13_18_df = client.query(EVIDENCE_SINGLE_CAT_13_18_QUERY).to_dataframe()
evidence_categories_df = pd.concat([evidence_categories_08_12_df, 
                                    evidence_categories_13_18_df])

evidence_categories_df = evidence_categories_df.groupby(['ev_cat']) \
                                    .sum().reset_index() \
                                    .sort_values(by=['number_of_articles'], ascending=False) \
                                    .reset_index(drop=True)

evidence_categories_df.head()
#>                                               ev_cat  number_of_articles
#> 0                                open (via free pdf)             2042872
#> 1            oa repository (semantic scholar lookup)             1233239
#> 2  oa repository (semantic scholar lookup)&open (...              867351
#> 3  oa repository (via OAI-PMH doi match)&oa repos...              535673
#> 4  oa journal (via doaj)&open (via page says lice...              533464

evidence_single_cat_df = evidence_df.groupby(['evidence']) \
                                .number_of_articles.sum() \
                                .reset_index()

evidence_single_cat_df = pd.merge(evidence_single_cat_df, evidence_categories_df, 
                                 how='left', left_on=['evidence'], right_on=['ev_cat']) \
                                    .drop(['ev_cat'], axis=1) 
evidence_single_cat_df.columns = ['evidence', 'number_of_articles', 'number_of_single_cat']

evidence_single_cat_df.evidence = evidence_single_cat_df \
                                    .evidence.replace(list_of_small_evidence_types, 'other')

evidence_single_cat_grouped_df = evidence_single_cat_df.groupby(['evidence']) \
                                    .sum() \
                                    .eval('prop = number_of_single_cat/number_of_articles') \
                                    .reset_index() \
                                    .sort_values(by=['number_of_articles'], ascending=False) \
                                    .reset_index(drop=True)

evidence_single_cat_grouped_df
#>                                             evidence  ...      prop
#> 0            oa repository (semantic scholar lookup)  ...  0.214117
#> 1              oa repository (via OAI-PMH doi match)  ...  0.098287
#> 2                                open (via free pdf)  ...  0.461410
#> 3                       open (via page says license)  ...  0.126779
#> 4                              oa journal (via doaj)  ...  0.115270
#> 5                   oa repository (via pmcid lookup)  ...  0.033182
#> 6  oa repository (via OAI-PMH title and first aut...  ...  0.219231
#> 7                  oa journal (via observed oa rate)  ...  0.299300
#> 8                        open (via crossref license)  ...  0.460794
#> 9                                              other  ...  0.417280
#> 
#> [10 rows x 4 columns]

y1 = evidence_single_cat_grouped_df \
                .sort_values(by=['number_of_articles'], ascending=True) \
                .reset_index(drop=True).evidence

x_total = evidence_single_cat_grouped_df.groupby(['evidence']) \
                .number_of_articles.count() \
                .to_frame().reset_index() \
                .sort_values(by=['number_of_articles'], ascending=True) \
                .reset_index(drop=True) \
                .number_of_articles


x_stacked = evidence_single_cat_grouped_df \
                .sort_values(by=['number_of_articles'], ascending=True) \
                .reset_index(drop=True) \
                .prop

fig, ax = plt.subplots(figsize=(7,5))

plt.style.use('seaborn-whitegrid')
#plt.rcParams['figure.figsize'] = [7, 5]
plt.grid(False, which='both', axis='y')
plt.box(False)

ax1 = plt.barh(y1, x_total, color='#b3b3b3a0')
ax2 = plt.barh(y1, x_stacked, color='#56B4E9')

ax.xaxis.set_major_formatter(mtick.PercentFormatter(1.0))

plt.title('Proportion of Articles per Evidence Type', 
          fontdict={'fontsize': 15, 'fontweight': 600}, pad=20)
plt.xlabel('Proportion of Articles', labelpad=10, 
           fontdict={'fontsize': 13, 'fontweight': 500})

plt.ylabel('Evidence Type', labelpad=10, fontdict={'fontsize': 13, 'fontweight': 500})

is_unique_patch = mpatches.Patch(color='#56B4E9', label='TRUE')
is_not_unique_patch = mpatches.Patch(color='#b3b3b3a0', label='FALSE')
plt.legend(handles=[is_unique_patch, is_not_unique_patch], title='Is unique?', 
           fontsize='medium', bbox_to_anchor=(1.2, 1.05), labelspacing=1.2)

plt.show()

To visualize the intersection between multiple evidence types, we are using the UpSetPlot package which is in contrast to py-upset better maintained. It offers better support for pandas and is also well documented. Especially the input format for the UpSet plot is well described.

The input format can be generated with the from_memberships method. It accepts two parameters: a nested list with elements corresponding to a set and secondly a list containing additional data which has the same length as the nested list.


evidence_categories_upset_df = evidence_categories_df.groupby(['ev_cat']) \
                                        .sum().reset_index() \
                                        .sort_values(by=['number_of_articles'], 
                                                     ascending=False) \
                                        .reset_index(drop=True)

# subset of fifteen most frequent evidence types
evidence_categories_upset_most_frequent = evidence_categories_upset_df[:15]

ev_list = evidence_categories_upset_most_frequent.ev_cat.tolist()
ev_list = [ev.split('&') for ev in ev_list]
n_list = evidence_categories_upset_most_frequent.number_of_articles.tolist()
evidence_categories_upset_expr = upsetplot.from_memberships(ev_list, data=n_list)

fig = plt.figure(figsize=(10,6))

axes = upsetplot.UpSet(evidence_categories_upset_expr, 
                sort_by='cardinality',
                sort_categories_by='cardinality',
                element_size=20, 
                intersection_plot_elements=15, 
                totals_plot_elements=7
               ).plot(fig=fig)

ax1 = axes['intersections']
ax1.yaxis.grid(False)

ax2 = axes['shading']

ax3 = axes['matrix']

ax4 = axes['totals']
ax4.xaxis.grid(False)

# reduce overlap with text
plt.subplots_adjust(left=0, bottom=0, right=1.1, top=1, wspace=0, hspace=0)

fig.text(0.1, -0.07, 'Set Size', ha='center', 
         fontdict={'fontsize': 10, 'fontweight': 500})

# delete shading axis for better readability
fig.delaxes(ax2)
    
plt.show()

Discussion and Conclusion

In this blog post, we provided an updated analysis of open access evidences in Unpaywall by examining database snapshots obtained from Unpaywall. Using Python, we were able to find 13,569,137 scholarly articles in Unpaywall in the period from 2008 to 2018 that are freely available. Fortunately, we can compare these results with the previous database snapshot from February 2019 which were described in the mentioned blog article. Thus we can interpret upcoming trends more precisely.

Based on the previous results, our analysis exposes an increase of open access full-texts by 6% from 37% to 43%. Furthermore, we were able to identify new evidence types that were introduced in the latest database snapshot. Altogether, this suggests that more articles left closed access and Unpaywall was able to link more DOI’s to articles found in the new introduced evidence types.

Additionally, this work demonstrates similarities between data analysis in Python and R. Due to the fact that this work is based on tranlated R source code, we learned plenty of approaches to face upcoming data analysis projects.